In [1]:

    
%matplotlib inline

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:

    
sns.set_context("talk")
sns.set_style("ticks")
np.random.seed(1337)

I will be writing about the Extreme value theory (EVT) which was introduced to me by my brother Sudhanshu, while he was working on his internship project. I really liked the connection it has with central limit theorem (CLT). The approach allowed me to better understand central limit theorem as a way to identify distribution of a function applied to independent and identically distributed (i.i.d.) samples. For CLT, the function is the mean or sum applied to the samples. For EVT the function is max or min.

I will do a multi part series trying to explain EVT using examples from real data. In the first part, the focus is on simply understanding the core ideas and introducing the limiting distributions for EVT.

Most of us know about the central limit theorem. It states that the sample means for i.i.d. samples from a distribution with finite variance, follows a normal distribution. More formally, we can write it as follows:

$$ \begin{equation} X_1, X_2, ..., X_n \sim p(X)\\ S_n = \frac{\sum_i X_i}{n}\\ S_n \sim \mathcal{N}(\mu, \sigma^2)\\ \end{equation} $$

Similarly, there exists a the Extreme value theory. This theory deals with the distribution of the extreme values (like minimum or maximum) from the (i.i.d.) samples from some distribution.

$$ \begin{equation} X_1, X_2, ..., X_n \sim p(X)\\ M_n = \max_i X_i\\ M_n \sim {\textrm {GEV}}(\mu ,\,\sigma ,\,\xi ) \\ \end{equation} $$

Surprisingly, like the central limit theorem, these values also converge on a class of distributions called the generalized extreme value distribution which characterizes three types of distributions. These distributions are:

Weibull law: $G(z)={\begin{cases}\exp \left\{-\left(-\left({\frac {z-b}{a}}\right)\right)^{\alpha }\right\}&z<b\\1&z\geq b\end{cases}}$ when the distribution of $M_{n}$ has a light tail with finite upper bound. Also known as Type 3.

Gumbel law: $G(z)=\exp \left\{-\exp \left(-\left({\frac {z-b}{a}}\right)\right)\right\}{\text{ for }}z\in \mathbb {R}$ . when the distribution of $M_{n}$ has an exponential tail. Also known as Type 1

Fréchet Law: $G(z)={\begin{cases}0&z\leq b\\\exp \left\{-\left({\frac {z-b}{a}}\right)^{-\alpha }\right\}&z>b.\end{cases}}$ when the distribution of $M_{n}$ has a heavy tail (including polynomial decay). Also known as Type 2.

In all cases, $\alpha >0$. [Source: https://en.wikipedia.org/wiki/Extreme_value_theory#Univariate_theory ]

The good thing about knowing the distribution of $M_n$ is that we can quantify what is the probability of observing a value as extreme as $M_n$. In case of maximum this can be computed by using the cumulative density function (CDF) as $P(M_n < x)$. If we set a threshhold on this probability (say $\delta = 99\%$) then we can either make systems which are robust to the extreme value $\delta$ percentage of times, also we can also know if we observe something which has CDF more than $\delta$, then it is a very rare event.

Let us see this in action. First we are going to create a draw_samples function which allows us to draw samples from various disitributions available in scipy.stats. We will draw samples of size $n$ and will draw $k$ such samples.



In [68]:

    
def draw_samples(dist, n, k=1):
    return dist.rvs(size=(n,k))

def plot_samples(dists, aggregations, n=1000, k_max=10000):
    rows, cols = len(aggregations), len(dists)
    fig, ax = plt.subplots(rows, cols, figsize=(5*cols, 3*rows))
    fig_cum, ax_cum = plt.subplots(rows, cols, figsize=(5*cols, 3*rows))
    for i, (dist_name, dist) in enumerate(dists):
        samples = draw_samples(dist, n, k_max)
        dist_mean = dist.mean()
        dist_std = dist.std()
        for j, (agg_name, agg) in enumerate(aggregations):
            for k in [int(10**p) for p in np.arange(1, np.log10(k_max)+1)]:
                normed_samples = samples[:, :k]
                #normed_samples = (samples[:, :k] - dist_mean)/dist_std
                X = agg(normed_samples, axis=1)
                #print(dist_name, agg_name, k, X.shape)
                ax[j, i].hist(
                    X, bins=100, weights=np.ones_like(X)/X.shape[0],
                    cumulative=False,
                    histtype="step", 
                    label=f"$k={k}$"
                )
                ax_cum[j, i].hist(
                    X, bins=100, weights=np.ones_like(X)/X.shape[0],
                    cumulative=True,
                    histtype="step", 
                    label=f"$k={k}$"
                )
            if j == 0:
                ax[j, i].set_title(dist_name)
                ax_cum[j, i].set_title(dist_name)
            if i == 0:
                ax[j, i].set_ylabel(f"$p({agg_name}[X_k])$")
                ax_cum[j, i].set_ylabel(f"$p({agg_name}[X_k] < x)$")
            if j == rows-1 and i == cols-1:
                ax[j, i].legend(loc="best", bbox_to_anchor=(1.01, 0.5))
                ax_cum[j, i].legend(loc="best", bbox_to_anchor=(1.01, 0.5))
    sns.despine(fig=fig, offset=10)
    sns.despine(fig=fig_cum, offset=10)
    fig.tight_layout()
    fig_cum.tight_layout()
    return (fig, ax), (fig_cum, ax_cum)



In [69]:

    
samples = draw_samples(stats.norm(), 100, 10000)
samples.shape









    Out[69]:





(100, 10000)

Let us compare the distribution of means, max, and min of samples from a few distributions distributions e.g. uniform, normal, exponential, logistic, and powerlaw. We are going to plot the PDF as well as the CDF of the distributions. The PDF tells us the neighborhood of high probability density, the CDF gives us a way to bound a certain percentage of the data.



In [71]:

    
dists = [
    ("uniform", stats.uniform()),
    ("norm", stats.norm()),
    ("expon ", stats.expon()),
    ("logistic", stats.logistic()),
    ("powerlaw", stats.powerlaw(a=5)),
]

aggregations = [
    ("mean", np.mean),
    ("max", np.max),
    ("min", np.min),
]
plot_samples(dists, aggregations, n=1000, k_max=10000);

As we can observe, as the number of samples $k$ increases, the estimation of the mean and the extreme values converges to the true value. However, unlike the central limit theorem, where the variance of estimated mean reduced for higher $k$, for extreme value distributions, the distribution shifts to either right (for max) or left (for min) as $k$ increases. The variance the extreme value distributions also reduces as $k$ increases.

For bounded distributions like uniform, exponential, and powerlaw, the extreme value converges to to the bound as $k$ increases.



In [3]:

    
normal = stats.norm()
gumbel = stats.genextreme(c=0)
frechet = stats.genextreme(c=-1)
weibull = stats.genextreme(c=1)



In [4]:

    
fig, ax = plt.subplots(1,2, sharex=True, sharey=True, figsize=(15, 6))

x = np.arange(-3, 5.001, 0.001)

ax[0].plot(x, normal.pdf(x), label="normal", lw=1)
ax[0].plot(x, gumbel.pdf(x), label="gumbel", lw=1)
ax[0].plot(x, frechet.pdf(x), label="frechet", lw=1)
ax[0].plot(x, weibull.pdf(x), label="weibull", lw=1)
ax[0].axvline(x=0, lw=0.5, linestyle="--", color="k")
ax[0].set_ylabel("$P(X=x)$")
ax[0].set_xlabel("$x$")

ax[1].plot(x, normal.cdf(x), label="normal", lw=1)
ax[1].plot(x, gumbel.cdf(x), label="gumbel", lw=1)
ax[1].plot(x, frechet.cdf(x), label="frechet", lw=1)
ax[1].plot(x, weibull.cdf(x), label="weibull", lw=1)
ax[1].axvline(x=0, lw=0.5, linestyle="--", color="k")
ax[1].axhline(y=0.9, lw=0.5, linestyle="--", color="k")
ax[1].set_ylabel("$P(X\leq x)$")
ax[1].set_xlabel("$x$")


lgd = fig.legend(
    *ax[1].get_legend_handles_labels(), 
    bbox_to_anchor=(0.8, 1.05), 
    bbox_transform=fig.transFigure,
    ncol=4
)

for lh in lgd.legendHandles:
    lh.set_linewidth(5)


sns.despine(offset=10)
fig.tight_layout()

References



In [5]:

    
iris = sns.load_dataset("iris")
iris.head()









    Out[5]:







  
    
      
      sepal_length
      sepal_width
      petal_length
      petal_width
      species
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      setosa
    
    
      3
      4.6
      3.1
      1.5
      0.2
      setosa
    
    
      4
      5.0
      3.6
      1.4
      0.2
      setosa



In [6]:

    
g = sns.pairplot(iris, hue="species", markers=["o", "s", "D"])



In [7]:

    
g = sns.pairplot(iris[["petal_length", "petal_width", "species"]], hue="species", markers=["o", "s", "D"])



In [8]:

    
X = iris[["petal_length", "petal_width"]]
y = iris["species"]



In [9]:

    
def get_known_mean(X, y, known_classes):
    known_idx = y.isin(known_classes)
    known_mean = X[known_idx].groupby(y[known_idx]).mean()
    return known_mean

def plot_known_mean(X, y, known_mean):
    ax = sns.scatterplot(data=X, x="petal_length", y="petal_width", hue=y, alpha=0.5)
    ax.plot(known_mean["petal_length"], known_mean["petal_width"], marker="*", linestyle="none", ms=20, color="k", alpha=0.9)
    sns.despine(offset=10)
    return ax



In [10]:

    
known_classes = ["setosa", "versicolor"]
known_mean = get_known_mean(X, y, known_classes)
known_mean









    Out[10]:







  
    
      
      petal_length
      petal_width
    
    
      species
      
      
    
  
  
    
      setosa
      1.462
      0.246
    
    
      versicolor
      4.260
      1.326



In [11]:

    
plot_known_mean(X, y, known_mean)









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f50f8d59e48>



In [12]:

    
def get_errors(X, y, known_mean):
    errors = {
        k: pd.concat([
            np.square(X[y==k] - known_mean.loc[cls]).sum(axis=1)
            for cls in known_mean.index
        ], axis=1).min(axis=1)
        for k in y.unique()
    }
    return errors

def get_error_bins(errors, nbins=40):
    tmp = np.hstack(list(errors.values()))
    print(tmp.shape)
    bins=np.histogram(tmp,bins=nbins)[1]
    print(bins.min(), bins.max())
    return bins



In [13]:

    
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)









    



(150,)
0.0022759999999999924 7.918276000000003



In [14]:

    
errors["setosa"].shape









    Out[14]:





(50,)



In [15]:

    
def fit_distributions(errors, known_mean):
    known_errors = pd.concat([errors[cls] for cls in known_mean.index])
    c, loc, scale = stats.genextreme.fit(
        known_errors, 
        loc=known_errors.mean(), 
        scale=known_errors.std()
    )
    print(f"c={c}, loc={loc}, scale={scale}")
    extreme_dist = stats.genextreme(c, loc, scale)
    print(extreme_dist.stats("mvsk"))
    loc, scale = stats.norm.fit(
        known_errors,
    )
    print(f"loc={loc}, scale={scale}")
    norm_dist = stats.norm(loc, scale)
    print(norm_dist.stats("mvsk"))
    return extreme_dist, norm_dist



In [16]:

    
extreme_dist, norm_dist = fit_distributions(errors, known_mean)









    



c=-1.538944057044235, loc=0.019979324415496713, scale=0.030411073884826514
(array(nan), array(nan), array(nan), array(nan))
loc=0.147582, scale=0.2558345446025614
(array(0.147582), array(0.06545131), array(0.), array(0.))






    



/home/napsternxg/miniconda3/lib/python3.7/site-packages/scipy/stats/_continuous_distns.py:2407: RuntimeWarning: invalid value encountered in power
  np.sign(c)*(-g3 + (g2 + 2*g2mg12)*g1)/g2mg12**1.5,



In [17]:

    
def plot_distributions(extreme_dist, norm_dist, bins):
    fig, ax = plt.subplots(1,2, sharex=True, sharey=False, figsize=(8, 3))
    lower = bins[0]
    #higher = max([extreme_dist.mean() + 2*extreme_dist.std(), norm_dist.mean() + 2*norm_dist.std()])
    higher = bins[-1]
    x = np.linspace(lower, higher, 1000)
    
    ax[0].plot(x, norm_dist.pdf(x), label="normal", lw=10, alpha=0.5)
    ax[0].plot(x, extreme_dist.pdf(x), label="extreme", lw=1)
    #ax[0].axvline(x=0, lw=0.5, linestyle="--", color="k")
    ax[0].set_ylabel("$P(X=x)$")
    ax[0].set_xlabel("$x$")

    ax[1].plot(x, norm_dist.cdf(x), label="normal", lw=10, alpha=0.5)
    ax[1].plot(x, extreme_dist.cdf(x), label="extreme", lw=1)
    #ax[1].axvline(x=0, lw=0.5, linestyle="--", color="k")
    ax[1].axhline(y=0.9, lw=0.5, linestyle="--", color="k")
    ax[1].set_ylabel("$P(X\leq x)$")
    ax[1].set_xlabel("$x$")

    lgd = fig.legend(
        *ax[1].get_legend_handles_labels(), 
        bbox_to_anchor=(0.8, 1.05), 
        bbox_transform=fig.transFigure,
        ncol=4
    )
    for lh in lgd.legendHandles:
        lh.set_linewidth(5)
    sns.despine(offset=10)
    fig.tight_layout()



In [18]:

    
plot_distributions(extreme_dist, norm_dist, bins)



In [19]:

    
def plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins):
    all_classes = y.unique()
    ncols = min(4, len(all_classes))
    nrows = round(len(all_classes)/ncols)
    fig, ax = plt.subplots(
        nrows, ncols, sharex=True, sharey=True, 
        figsize=(5*ncols, 4*nrows)
    )
    x = np.arange(bins[0], bins[-1]+0.01, 0.01)
    threshhold_cdf = 0.9
    threshhold_x = extreme_dist.ppf(threshhold_cdf)
    for i, (axi, cls) in enumerate(zip(ax.flatten(), all_classes)):
        x_err = errors[cls]
        color = "red"
        if cls in known_classes:
            color = "green"
        axi.hist(
            x_err, bins, cumulative=True, density=True,
            #weights=np.ones_like(errors)/np.shape(errors)[0], 
            edgecolor=color, linewidth=10, histtype="step", alpha=0.5
        )
        axi.plot(x, extreme_dist.cdf(x), lw=1, color="k", label="extreme")
        axi.plot(x, norm_dist.cdf(x), lw=10, color="b", alpha=0.1, label="normal")
        axi.axhline(y=threshhold_cdf, lw=1, linestyle="--", color="0.5")
        axi.axvline(x=threshhold_x, lw=1, linestyle="--", color="0.5")
        axi.set_title(f"Mean[{cls}]={np.mean(x_err):.4f}", fontsize=20)
    axi.legend(loc="best")
    sns.despine(offset=10)
    fig.tight_layout()



In [20]:

    
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)



In [21]:

    
def plot_cdf_cdf(errors, known_classes, extreme_dist):
    all_classes = y.unique()
    ncols = min(4, len(all_classes))
    nrows = round(len(all_classes)/ncols)
    fig, ax = plt.subplots(
        nrows, ncols, sharex=True, sharey=True, 
        figsize=(5*ncols, 4*nrows)
    )
    bins = np.arange(0, 1.01, 0.01)
    for i, (axi, cls) in enumerate(zip(ax.flatten(), all_classes)):
        x_err = errors[cls]
        color = "red"
        if cls in known_classes:
            color = "green"
        errors_cdf = extreme_dist.cdf(x_err)
        axi.hist(
            errors_cdf, bins, cumulative=True, density=True,
            edgecolor=color, linewidth=10, histtype="step", alpha=0.5
        )
        axi.set_title(f"Mean[{cls}]={np.mean(errors_cdf):.4f}", fontsize=20)
    sns.despine(offset=10)
    fig.tight_layout()



In [22]:

    
plot_cdf_cdf(errors, known_classes, extreme_dist)



In [23]:

    
def scatter_known_unknown(X, y, errors, extreme_dist):
    fig = plt.figure()
    known_unknown = pd.concat(errors.values()).apply(extreme_dist.cdf)
    print((known_unknown > 0.95).value_counts())
    points = plt.scatter(
        data=X, 
        x="petal_length", 
        y="petal_width",
        c=known_unknown, 
        s=40, 
        cmap="viridis_r"
    )
    plt.colorbar(points)
    sns.despine(offset=10)
    
def scatter_error(X, y, errors):
    fig = plt.figure()
    points = plt.scatter(
        data=X, 
        x="petal_length", 
        y="petal_width",
        c=pd.concat(errors.values()), 
        s=40, 
        cmap="viridis_r"
    )
    plt.colorbar(points)
    sns.despine(offset=10)



In [24]:

    
scatter_known_unknown(X, y, errors, extreme_dist)
scatter_error(X, y, errors)









    



False    123
True      27
dtype: int64

Try out with a different unknown class



In [25]:

    
known_classes = ["setosa", "virginica"]
known_mean = get_known_mean(X, y, known_classes)
known_mean









    Out[25]:







  
    
      
      petal_length
      petal_width
    
    
      species
      
      
    
  
  
    
      setosa
      1.462
      0.246
    
    
      virginica
      5.552
      2.026



In [26]:

    
plot_known_mean(X, y, known_mean)
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)
plot_distributions(extreme_dist, norm_dist, bins)
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
plot_cdf_cdf(errors, known_classes, extreme_dist)
scatter_known_unknown(X, y, errors, extreme_dist)
scatter_error(X, y, errors)









    



(150,)
0.003559999999999977 4.721959999999999
c=-1.7429880648379807, loc=0.02479813176141388, scale=0.04106729563327325
(array(nan), array(nan), array(nan), array(nan))
loc=0.20643, scale=0.32417747006847975
(array(0.20643), array(0.10509103), array(0.), array(0.))
False    146
True       4
dtype: int64

Try with all iris features



In [27]:

    
X = iris[["petal_length", "petal_width", "sepal_length", "sepal_width"]]
y = iris["species"]



In [28]:

    
known_classes = ["setosa", "virginica"]
known_mean = get_known_mean(X, y, known_classes)
known_mean









    Out[28]:







  
    
      
      petal_length
      petal_width
      sepal_length
      sepal_width
    
    
      species
      
      
      
      
    
  
  
    
      setosa
      1.462
      0.246
      5.006
      3.428
    
    
      virginica
      5.552
      2.026
      6.588
      2.974



In [29]:

    
errors = get_errors(X, y, known_mean)
bins = get_error_bins(errors, nbins=40)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)
plot_distributions(extreme_dist, norm_dist, bins)
plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
plot_cdf_cdf(errors, known_classes, extreme_dist)









    



(150,)
0.00438000000000002 6.761180000000001
c=-0.7443015169336825, loc=0.1973869250176823, scale=0.2145743274723953
(array(0.929725), array(nan), array(nan), array(nan))
loc=0.5868100000000002, scale=0.7672856861039443
(array(0.58681), array(0.58872732), array(0.), array(0.))

Experiment with Faces data



In [30]:

    
from sklearn import datasets



In [31]:

    
faces = datasets.fetch_olivetti_faces()



In [32]:

    
print(faces.DESCR)









    



.. _olivetti_faces_dataset:

The Olivetti faces dataset
--------------------------

`This dataset contains a set of face images`_ taken between April 1992 and 
April 1994 at AT&T Laboratories Cambridge. The
:func:`sklearn.datasets.fetch_olivetti_faces` function is the data
fetching / caching function that downloads the data
archive from AT&T.

.. _This dataset contains a set of face images: http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

As described on the original website:

    There are ten different images of each of 40 distinct subjects. For some
    subjects, the images were taken at different times, varying the lighting,
    facial expressions (open / closed eyes, smiling / not smiling) and facial
    details (glasses / no glasses). All the images were taken against a dark
    homogeneous background with the subjects in an upright, frontal position 
    (with tolerance for some side movement).

**Data Set Characteristics:**

    =================   =====================
    Classes                                40
    Samples total                         400
    Dimensionality                       4096
    Features            real, between 0 and 1
    =================   =====================

The image is quantized to 256 grey levels and stored as unsigned 8-bit 
integers; the loader will convert these to floating point values on the 
interval [0, 1], which are easier to work with for many algorithms.

The "target" for this database is an integer from 0 to 39 indicating the
identity of the person pictured; however, with only 10 examples per class, this
relatively small dataset is more interesting from an unsupervised or
semi-supervised perspective.

The original dataset consisted of 92 x 112, while the version available here
consists of 64x64 images.

When using these images, please give credit to AT&T Laboratories Cambridge.



In [33]:

    
faces.data.shape









    Out[33]:





(400, 4096)



In [34]:

    
faces.target.shape









    Out[34]:





(400,)



In [35]:

    
faces = pd.DataFrame(faces.data, columns=[f"f{i}" for i in range(faces.data.shape[1])]).assign(label=faces.target)
faces.head()









    Out[35]:







  
    
      
      f0
      f1
      f2
      f3
      f4
      f5
      f6
      f7
      f8
      f9
      ...
      f4087
      f4088
      f4089
      f4090
      f4091
      f4092
      f4093
      f4094
      f4095
      label
    
  
  
    
      0
      0.309917
      0.367769
      0.417355
      0.442149
      0.528926
      0.607438
      0.657025
      0.677686
      0.690083
      0.685950
      ...
      0.669421
      0.652893
      0.661157
      0.475207
      0.132231
      0.148760
      0.152893
      0.161157
      0.157025
      0
    
    
      1
      0.454545
      0.471074
      0.512397
      0.557851
      0.595041
      0.640496
      0.681818
      0.702479
      0.710744
      0.702479
      ...
      0.157025
      0.136364
      0.148760
      0.152893
      0.152893
      0.152893
      0.152893
      0.152893
      0.152893
      0
    
    
      2
      0.318182
      0.400826
      0.491736
      0.528926
      0.586777
      0.657025
      0.681818
      0.685950
      0.702479
      0.698347
      ...
      0.132231
      0.181818
      0.136364
      0.128099
      0.148760
      0.144628
      0.140496
      0.148760
      0.152893
      0
    
    
      3
      0.198347
      0.194215
      0.194215
      0.194215
      0.190083
      0.190083
      0.243802
      0.404959
      0.483471
      0.516529
      ...
      0.636364
      0.657025
      0.685950
      0.727273
      0.743802
      0.764463
      0.752066
      0.752066
      0.739669
      0
    
    
      4
      0.500000
      0.545455
      0.582645
      0.623967
      0.648760
      0.690083
      0.694215
      0.714876
      0.723140
      0.731405
      ...
      0.161157
      0.177686
      0.173554
      0.177686
      0.177686
      0.177686
      0.177686
      0.173554
      0.173554
      0
    
  

5 rows × 4097 columns

Let us limit ourselves to only 15 classes with 10 known



In [36]:

    
X = faces[faces.label < 16].filter(regex="^f")
y = faces[faces.label < 16]["label"]
X.shape, y.shape, y.unique().shape









    Out[36]:





((160, 4096), (160,), (16,))



In [37]:

    
known_classes = list(range(10))
known_mean = get_known_mean(X, y, known_classes)
known_mean









    Out[37]:







  
    
      
      f0
      f1
      f2
      f3
      f4
      f5
      f6
      f7
      f8
      f9
      ...
      f4086
      f4087
      f4088
      f4089
      f4090
      f4091
      f4092
      f4093
      f4094
      f4095
    
    
      label
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      0.341322
      0.375620
      0.416942
      0.454959
      0.497934
      0.544628
      0.583884
      0.619008
      0.643802
      0.655785
      ...
      0.398347
      0.390083
      0.383884
      0.347934
      0.335124
      0.301240
      0.278512
      0.275207
      0.277686
      0.276860
    
    
      1
      0.623967
      0.645868
      0.677686
      0.697934
      0.703306
      0.707851
      0.707438
      0.713223
      0.712810
      0.708678
      ...
      0.414050
      0.354132
      0.294215
      0.248347
      0.206612
      0.211157
      0.200826
      0.147934
      0.125620
      0.116116
    
    
      2
      0.378926
      0.395041
      0.419008
      0.462810
      0.504959
      0.552066
      0.602066
      0.666116
      0.713636
      0.737603
      ...
      0.297107
      0.306612
      0.309504
      0.307851
      0.303306
      0.303719
      0.290909
      0.292975
      0.274793
      0.268182
    
    
      3
      0.436777
      0.478926
      0.489669
      0.500413
      0.539669
      0.571074
      0.632231
      0.691736
      0.719008
      0.738430
      ...
      0.373554
      0.337190
      0.344628
      0.348347
      0.352066
      0.348760
      0.319008
      0.309917
      0.297107
      0.284711
    
    
      4
      0.496694
      0.515289
      0.565289
      0.607851
      0.647107
      0.672314
      0.689669
      0.699587
      0.707025
      0.715289
      ...
      0.442562
      0.443802
      0.448760
      0.450413
      0.442149
      0.424380
      0.428512
      0.411983
      0.399587
      0.395455
    
    
      5
      0.553719
      0.585950
      0.609091
      0.629752
      0.631405
      0.676859
      0.701653
      0.729339
      0.756612
      0.783058
      ...
      0.608678
      0.622314
      0.575620
      0.594215
      0.635124
      0.582645
      0.599174
      0.621488
      0.590083
      0.576859
    
    
      6
      0.237190
      0.250413
      0.310331
      0.369835
      0.471901
      0.521901
      0.562810
      0.592975
      0.628926
      0.692149
      ...
      0.487190
      0.489256
      0.443802
      0.446281
      0.420661
      0.400413
      0.360331
      0.329752
      0.279752
      0.252066
    
    
      7
      0.411157
      0.461157
      0.525620
      0.603719
      0.664463
      0.708264
      0.736364
      0.764463
      0.792149
      0.809091
      ...
      0.381405
      0.375620
      0.364050
      0.348347
      0.283471
      0.235124
      0.206198
      0.237603
      0.253306
      0.260744
    
    
      8
      0.411157
      0.430165
      0.450826
      0.482231
      0.534711
      0.580579
      0.611570
      0.638017
      0.657025
      0.655785
      ...
      0.362810
      0.364876
      0.361983
      0.355372
      0.354132
      0.357025
      0.346694
      0.332645
      0.342562
      0.345868
    
    
      9
      0.399174
      0.471074
      0.549174
      0.611570
      0.651240
      0.677273
      0.703306
      0.718595
      0.732231
      0.736364
      ...
      0.141736
      0.133884
      0.137190
      0.132231
      0.122727
      0.128512
      0.120248
      0.131405
      0.149587
      0.160331
    
  

10 rows × 4096 columns



In [38]:

    
%time errors = get_errors(X, y, known_mean)
extreme_dist, norm_dist = fit_distributions(errors, known_mean)









    



CPU times: user 3min 36s, sys: 31.2 ms, total: 3min 36s
Wall time: 3min 39s
c=-0.12452055920104532, loc=30.314841146660864, scale=14.255744567468316
(array(40.53064726), array(494.32552816), array(2.18340756), array(11.09063013))
loc=40.56266403198242, scale=22.55886459350586
(array(40.56266403), array(508.90237427), array(0.), array(0.))



In [39]:

    
bins = get_error_bins(errors, nbins=40)
%time plot_distributions(extreme_dist, norm_dist, bins)
%time plot_error_cdf(errors, known_classes, extreme_dist, norm_dist, bins)
%time plot_cdf_cdf(errors, known_classes, extreme_dist)









    



(160,)
10.425526 134.32831
CPU times: user 172 ms, sys: 15.6 ms, total: 188 ms
Wall time: 176 ms
CPU times: user 1.86 s, sys: 78.1 ms, total: 1.94 s
Wall time: 1.93 s
CPU times: user 1.78 s, sys: 31.2 ms, total: 1.81 s
Wall time: 1.82 s



In [ ]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

	petal_length	petal_width	sepal_length	sepal_width
species
setosa	1.462	0.246	5.006	3.428
virginica	5.552	2.026	6.588	2.974

	f0	f1	f2	f3	f4	f5	f6	f7	f8	f9	...	f4087	f4088	f4089	f4090	f4091	f4092	f4093	f4094	f4095
0	0.309917	0.367769	0.417355	0.442149	0.528926	0.607438	0.657025	0.677686	0.690083	0.685950	...	0.669421	0.652893	0.661157	0.475207	0.132231	0.148760	0.152893	0.161157	0.157025
1	0.454545	0.471074	0.512397	0.557851	0.595041	0.640496	0.681818	0.702479	0.710744	0.702479	...	0.157025	0.136364	0.148760	0.152893	0.152893	0.152893	0.152893	0.152893	0.152893
2	0.318182	0.400826	0.491736	0.528926	0.586777	0.657025	0.681818	0.685950	0.702479	0.698347	...	0.132231	0.181818	0.136364	0.128099	0.148760	0.144628	0.140496	0.148760	0.152893
3	0.198347	0.194215	0.194215	0.194215	0.190083	0.190083	0.243802	0.404959	0.483471	0.516529	...	0.636364	0.657025	0.685950	0.727273	0.743802	0.764463	0.752066	0.752066	0.739669
4	0.500000	0.545455	0.582645	0.623967	0.648760	0.690083	0.694215	0.714876	0.723140	0.731405	...	0.161157	0.177686	0.173554	0.177686	0.177686	0.177686	0.177686	0.173554	0.173554

	f0	f1	f2	f3	f4	f5	f6	f7	f8	f9	...	f4086	f4087	f4088	f4089	f4090	f4091	f4092	f4093	f4094	f4095
label
0	0.341322	0.375620	0.416942	0.454959	0.497934	0.544628	0.583884	0.619008	0.643802	0.655785	...	0.398347	0.390083	0.383884	0.347934	0.335124	0.301240	0.278512	0.275207	0.277686	0.276860
1	0.623967	0.645868	0.677686	0.697934	0.703306	0.707851	0.707438	0.713223	0.712810	0.708678	...	0.414050	0.354132	0.294215	0.248347	0.206612	0.211157	0.200826	0.147934	0.125620	0.116116
2	0.378926	0.395041	0.419008	0.462810	0.504959	0.552066	0.602066	0.666116	0.713636	0.737603	...	0.297107	0.306612	0.309504	0.307851	0.303306	0.303719	0.290909	0.292975	0.274793	0.268182
3	0.436777	0.478926	0.489669	0.500413	0.539669	0.571074	0.632231	0.691736	0.719008	0.738430	...	0.373554	0.337190	0.344628	0.348347	0.352066	0.348760	0.319008	0.309917	0.297107	0.284711
4	0.496694	0.515289	0.565289	0.607851	0.647107	0.672314	0.689669	0.699587	0.707025	0.715289	...	0.442562	0.443802	0.448760	0.450413	0.442149	0.424380	0.428512	0.411983	0.399587	0.395455
5	0.553719	0.585950	0.609091	0.629752	0.631405	0.676859	0.701653	0.729339	0.756612	0.783058	...	0.608678	0.622314	0.575620	0.594215	0.635124	0.582645	0.599174	0.621488	0.590083	0.576859
6	0.237190	0.250413	0.310331	0.369835	0.471901	0.521901	0.562810	0.592975	0.628926	0.692149	...	0.487190	0.489256	0.443802	0.446281	0.420661	0.400413	0.360331	0.329752	0.279752	0.252066
7	0.411157	0.461157	0.525620	0.603719	0.664463	0.708264	0.736364	0.764463	0.792149	0.809091	...	0.381405	0.375620	0.364050	0.348347	0.283471	0.235124	0.206198	0.237603	0.253306	0.260744
8	0.411157	0.430165	0.450826	0.482231	0.534711	0.580579	0.611570	0.638017	0.657025	0.655785	...	0.362810	0.364876	0.361983	0.355372	0.354132	0.357025	0.346694	0.332645	0.342562	0.345868
9	0.399174	0.471074	0.549174	0.611570	0.651240	0.677273	0.703306	0.718595	0.732231	0.736364	...	0.141736	0.133884	0.137190	0.132231	0.122727	0.128512	0.120248	0.131405	0.149587	0.160331